NSF PAR Search | NSF Public Access Repository

Narrowing the GAP: Enhancing gem5’s GPU Memory Bandwidth Accuracy

Xia, Yu; Ramadas, Vishnu; Poremba, Matthew; Sinclair, Matthew D (June 2025, 6th gem5 Users Workshop)

Computer systems research heavily relies on simulation tools like gem5 to effectively prototype and validate new ideas. However, publicly available simulators struggle to accurately model systems as architectures evolve rapidly. This is a major issue because incorrect simulator models may lead researchers to draw misleading or even incorrect conclusions about their research prototypes from these simulators. Although this challenge pertains to many open source simulators, we focus on the widely used, open source gem5 simulator. In GAP we showed that gem5’s GPGPU models have significant correlation issues versus real hardware. GAP also improved the fidelity of gem5’s AMDGPU model, particularly for cache access latencies and bandwidths. However, one critical issue remains: our microbenchmarks reveal 88% error in memory bandwidth between gem5’s current model and corresponding real AMD GPUs. To narrow this gap, we examined recent patents and gem5’s memory system bottlenecks, then made several improvements including: utilizing a redesigned HBM memory controller, enhancing TLB request coalescing, adding support for multiple page sizes, adding a page walk cache, and improving network bandwidth modeling. Collectively, these optimizations significantly improve gem5’s GPU memory bandwidth by 3.8x: from 153 GB/s to 583 GB/s. Moreover, our address translation enhancements can be ported to other ISAs where similar support is also needed, improving gem5’s MMU support.

Free, publicly-accessible full text available June 22, 2026

In recent years deep neural networks (DNNs) have emerged as an important application domain driving the requirements for future systems. As DNNs get more sophisticated, their compute requirements and the datasets they are trained on continue to grow at a fast rate. For example, Gholami showed that compute in Transformer networks grew 750X over 2 years, while other work projects DNN compute and memory requirements to grow by 1.5X per year. Given their growing requirements and importance, heterogeneous systems often add machine learning (ML) specific features (e.g., TensorCores) to improve their efficiency. However, given ML’s voracious rate of growth and size, there is a growing challenge in performing early-system exploration based on sound simulation methodology. In this work we discuss our efforts to enhance gem5’s support to make these workloads practical to run while retaining accuracy.

Search for: All records